In this chapter, the introduction to required background theory for this thesis is continued.
This chapter introduces \textbf{computer vision} (Hartley and Zisserman \cite{Hartley2004}).
A clear vocabulary and corresponding mathematical nomenclature will be established that will be used consistently in the remainder of this thesis.
Computer vision uses the notation of Hartley and Zisserman \cite{Hartley2004}.

Laser range scanners are often used for robot localization.
However, this sensor has a number of problems (e.g., limited range, deal with dynamic environments).
Vision has long been advertised as a solution.
A camera can make a robot perceive the world in a way similar to humans.
\textbf{Computer vision} is a field that includes methods for acquiring, processing, analyzing and understanding images.
Since a camera is the AR.Drone's only sensor capable of retrieving detailed information of the environment, computer vision techniques are essential when performing robot localization with the AR.Drone.
%In addition, vision techniques can be used to calibrate other sensors of a robot.

\section{Projective geometry}
\label{sec:background-projective-geometry}
Projective geometry describes the relation between a scene and how a picture of it is formed.
A projective transformation is used to map objects into a flat picture.
This transformation preserves straightness.
Multiple geometrical systems can be used to describe geometry.

\textbf{Euclidean geometry} is a popular mathematical system.
A point in Euclidean 3D space is represented by an unordered pair of real numbers, $(x, y, z)$.
However, this coordinate system is unable to represent vanishing points.
The \textit{Homogeneous coordinate system} is able to represent points at infinity.
Homogeneous coordinates have an additional coordinate.
Now, points are represented by \textit{equivalance classes}, where points are equivalent when they differ by a common multiple.
For example, $(x, y, z, 1)$ and $(2x, 2y, 2z, 2)$ represent the same point.
Points at infinity are represented using $(x, y, z, 0)$, because $(x/0, y/0, z/0)$ is infinite.
In Euclidean geometry, one point is picked out as the \textit{origin}.
A coordinate can be transformed to another coordinate by translating and rotating to a different position.
This operation is known as a \textbf{Euclidian transformation}.
A more general type of transformation is the \textbf{affine transformation}, which is a linear transformation (linear stretching), followed by a Euclidian transformation.
The resulting transformation performs moving (translating), rotating and
%finally
stretching linearly.
% possibly by different ratios in different directions.

In computer vision problems, projective space is used as a convenient way of representing the real 3D world.
A \textbf{projective transformation} of projective space $\mathds{P}^n$ is represented by a linear transformation of homogeneous coordinates:
\begin{equation}
X' = H_{(n+1)} \times _{(n+1)}X
\end{equation}
where $X$ is a coordinate vector, $H$ is a non-singular matrix and $X'$ is a vector with the transformed coordinates.

In order to analyze and manipulate images, understanding about the image formation process is required.
Light from a scene reaches the camera and passes through a lens before it reaches the sensor.
The relationship between the distance to an object $z$ and the distance behind the lens $e$ at which a focused image is formed, can be expressed as:
\begin{equation}
\label{eq:optics}
\frac{1}{f} = \frac{1}{z} + \frac{1}{e}
\end{equation}
where $f$ is the focal length.

The \textbf{pinhole camera} has been the first known example of a camera \cite{gernsheim1969history}.
%\footnote{\url{http://inventors.about.com/od/pstartinventions/a/stilphotography.htm}}
It is a lightproof box with a very small aperture instead of a lens.
Light from the scene passes through the aperture and projects an \textit{inverted image} on the opposite side of the box.
This principle has been adopted as a standard model for perspective cameras.
In this model, the \textit{optical center} $C$ (aperture) corresponds to the center of the lens.
The intersection $O$ between the optical axis and the image plane is called \textit{principal point}.
The pinhole model is commonly described with the image plane between $C$ and the scene in order to preserve the same orientation as the object.

The operation performed by the camera is called the \textbf{perspective transformation}.
It transforms the 3D coordinate of scene point $P$ to coordinate $p$ on the image plane.
A simplified version of the perspective transformation can be expressed as:
\begin{equation}
\frac{f}{P_z} = \frac{p_x}{P_x} = \frac{p_y}{P_y}
\end{equation}
from which $p_x$ and $p_y$ can be recovered:
\begin{equation}
p_x = \frac{f}{P_z} \cdot P_x
\end{equation}
\begin{equation}
p_y = \frac{f}{P_z} \cdot P_y
\end{equation}
When using homogeneous coordinates, a linear transformation is obtained:
\begin{equation}
\left[ \begin{array}{c}
\lambda p_x \\
\lambda p_y \\
\lambda
\end{array} \right]
=
\left[ \begin{array}{c}
f P_x \\
f P_y \\
P_z
\end{array} \right]
=
\left[ \begin{array}{c c c c}
f & 0 & 0 & 0  \\
0 & f & 0 & 0 \\
0 & 0 & 1 & 0
\end{array} \right]
\left[ \begin{array}{c}
P_x \\
P_y \\
P_z \\
1
\end{array} \right]
\end{equation}
From this equation can be seen that it is not possible to estimate the distance to a point (i.e., all points on a line project to a single point on the image plane).

A more realistic camera model is called the \textbf{general camera model}.
It takes into account the rigid body transformation between the camera and the scene, and \textit{pixelization} (i.e., shape and position of the camera's sensor).
Pixelization is addressed by replacing the focal length matrix with a \textit{camera intrinsic parameter matrix} $A$.
The rigid body transformation is addressed by adding a combined rotation and translation matrix $[R|t]$, which are called the \textit{camera extrinsic parameters}.
\begin{equation}
\lambda p = A [ R | t] P
\end{equation}
or
\begin{equation}
\left[ \begin{array}{c}
\lambda p_x \\
\lambda p_y \\
\lambda
\end{array} \right]
=
\left[ \begin{array}{c c c}
\alpha_u & 0 & u_0 \\
0 & \alpha_v & v_0 \\
0 & 0 & 1
\end{array} \right]
\left[ \begin{array}{c c c c}
r_{11} & r_{12} & r_{13} & t_1 \\
r_{21} & r_{22} & r_{23} & t_2 \\
r_{31} & r_{32} & r_{33} & t_3
\end{array} \right]
\left[ \begin{array}{c}
P_x \\
P_y \\
P_z \\
1
\end{array} \right]
\end{equation}

\textbf{Camera calibration} is the process of measuring the intrinsic and extrinsic parameters of the camera.
As explained in the previous paragraph, these parameters are required to calculate the mapping between 3D scene points to 2D points on the image plane.
When scene points $P$ and image points $p$ are known, it is possible to compute $A$, $R$, $t$ by solving the perspective projection equation.
Newer camera calibration techniques use a planar grid (e.g., chessboard-like pattern) instead of 3D points, to ease the extraction of corners.
The method requires several pictures of the pattern shown at different positions and orientations.
Because the 2D positions of the corners are known and matched between all images, the intrinsic and extrinsic parameters are determined simultaneously by applying a least-square minimization.

\section{Feature extraction and matching}
\label{sec:computer-vision-feature-extraction-matching}
In order to perform robot localization using vision sensors, the robot must be able to relate recent camera frames to previously received camera frames (map).
However, all measurements have an error, which complicates the matching.
A strategy to deal with uncertainty is by generating a higher-level description of the sensor data instead of using the raw sensor data.
This process is called \textit{feature extraction}.

\begin{mydef}
\textbf{Features} are recognizable structures of elements in the environment.
They are extracted from measurements and described in a mathematical way.
\textit{Low-level features} are geometric primitives (e.g., lines, points, corners) and \textit{high-level features} are objects (e.g., doors, tables).
\end{mydef}

Features play an essential role %
in the software of mobile robots.
They enable more compact and robust descriptions of the environment.
Choosing the appropriate features is a critical task when designing a mobile robot.

\subsection{Feature extraction}
\label{sec:background-feature-extraction}
A local feature is a small image patch that is different than its immediate neighborhood.
Difference can be in terms of intensity, color and texture.
Commonly, features are edges, corners or junctions.
The most important aspect of a feature detector is \textit{repeatability}, which means the detector is able to detect the same features when multiple images (with different viewing and illumination conditions) of the same scene are used.
Another important aspect of a feature detector is \textit{distinctiveness}, which means the information carried by a patch is distinctive as possible.
This is important for robust feature matching.
A good feature is \textit{invariant}, meaning that changes in camera viewpoint, illumination, rotation and scale do not affect a feature and its high-level description.

\textbf{Scale-invariant feature transform (SIFT)} \cite{lowe1999object} is probably the most popular feature extractor.
It is invariant to scale, rotation, illumination and viewpoint.
The algorithm can be outlined as follows.
First, an internal representation of an image is computed to ensure scale invariance.
Secondly, an approximation of the Laplacian of Gaussian is used to find interesting keypoints.
Third, keypoints are found at maxima and minima in the Difference of Gaussian from the previous step.
Fourth, bad keypoints (e.g., edges and low contrast regions) are eliminated.
Fifth, an orientation is calculated for each keypoint.
Any further calculations are done relative to this orientation, making it rotation invariant.
Finally, a high-level representation is calculated to uniquely identify features.
This \textit{feature descriptor} is a vector of 128 elements, containing orientation histograms.

For some applications or mobile platforms, the SIFT extractor is too slow.
Therefore, faster variants are proposed, like \textbf{Speeded Up Robust Feature (SURF)} \cite{Bay2008cviu}.
%SURF is partly inspired by the SIFT descriptor.
The standard version of SURF is several times faster than SIFT and claimed by its authors to be more robust against different image transformations than SIFT.
The important speed gain is achieved by using integral images, which drastically reduces the number of operations for simple box convolutions, independent of the chosen scale.

The algorithm can be outlined as follows.
Initially, an integral image and Hessian matrix \cite{van1994complex} are computed.
The Hessian matrix is a square matrix of second-order partial derivatives of a function, describing the local curvature of a function.
Blob-like structures are detected at locations where the determinant matrix is maximum.
The images are repeatedly smoothed with a Gaussian kernel and then sub-sampled in order to create a pyramid of different resolutions.
In order to find interest points in the image and over scales, a non-maximum suppression in a 
%$3 \times 3 \times 3$
neighborhood is applied. 
The maxima of the determinant of the Hessian matrix are then interpolated in scale and image space.
A descriptor is constructed that describes the distribution of the intensity content within the interest point neighborhood.
First-order Haar wavelet \cite{stollnitz1995wavelets} responses in $x$ and $y$ direction are used rather than the gradient, exploiting integral images for speed. The size of the descriptor is reduced to $64$ dimensions.

Another feature extractor is the \textbf{GIST} descriptor \cite{oliva2001modeling}, which characterizes several important statistics about a scene.
The GIST descriptor measures the global distribution of oriented line segments in an image.
This makes the descriptor well suited for determining the type of environment.
The GIST feature is computed by convolving an oriented filter with the image at several different orientations and scales. 
The scores of the convolution at each orientation and scale are stored in array.

\subsection{Feature matching}
\label{sec:theory_feature_matching}
To perform robot localization, extracted feature descriptors need to be matched against the feature descriptors that are part of the map.
Ideally, each feature extracted from the current frame is correlated against the corresponding feature inside the map.

A \textit{matching function} is used to compute the correspondence between two feature descriptors.
The matching function depends on the type of feature descriptor.
In general, a feature descriptor is a high-dimensional vector and matching features can be found by computing the distance using the $L_2$ norm, which is defined as:
\begin{equation}
L2(a,b) =\sqrt { \sum_{i=1}^{N} \left| a_i - b_i \right| ^2 }
\end{equation}
where $N$ is the dimensionality of the descriptor vector.

The matching function is used to find matches between two sets of descriptors.
A popular matcher is the \textit{Brute-force descriptor matcher}. For each descriptor in the first set, this matcher finds the closest descriptor in the second set by trying each one.
However, the brute-force matcher becomes slow for large descriptors sets, due to the complexity of $O(n^2)$.
For large descriptor sets, the \textit{Fast Library for Approximate Nearest Neighbors (FLANN)} \cite{muja2009flann} matcher performs more efficient matching.
It uses an Approximate Nearest Neighbors algorithm to find similar descriptors efficiently.
The authors state that the approximation has proven to be good-enough in most practical applications.
In most cases it is orders of magnitude faster than performing exact searches.

In this thesis, the Brute-force descriptor matcher is used.
The estimated pose of the vehicle is used to select a part of the map (descriptor set), which reduces the size of the descriptor set.
Furthermore, the resolution of the map (descriptor set) is reduced by keeping only the best feature for a certain area inside the map.
For such limited descriptor sets, the Brute-force descriptor matcher performs quite well.


\subsection{Popular application: image stitching}
A popular computer vision application that heavily relies on features is \textit{image stitching}, which is the process of combining multiple photos with overlapping fields of view.
The result is a segmented panorama or high-resolution image. 

The first step involves extracting features for all images.
Since most features are invariant under rotation and scale changes, images with varying orientation and zoom can be used for stitching.
%These features are matched to search for image alignments that minimize the sum of absolute differences between overlapping pixels.
Images that have a large number of matches between them are identified.
These images are likely to have some degree of overlap.
A homography $\boldsymbol{H}$ (projective transformation) is computed between matching image pairs.
This homography describes how the second image needs to be transformed in order to correctly overlap the first image.
Different types of homographies can be used to describe the transformation between image pairs, depending on the level of displacement between images.

The homography $\boldsymbol{H}$ cannot be computed from all feature matches directly.
False matches would reduce the accuracy of the estimated homography.
Instead, a robust estimation procedure is applied.
\textbf{RANSAC (Random Sample Consensus)} \cite{fischler1981random} is a robust estimation procedure that uses a minimal set of randomly sampled correspondences to estimate image transformation parameters, finding a solution that has the best consensus with the data.
Commonly, four random feature correspondences are used to compute the homography $\boldsymbol{H}$ of an image pair.
This step is repeated $N$ trials and the solution that has the maximum number of inliers (whose projections are consistent with $\boldsymbol{H}$ within a tolerance $\epsilon$ pixels) is selected.

When the homographies of all image pairs are estimated, a global optimization method can be applied to reduce drift or other sources of error.
Finally, the images are transformed according to the corresponding homographies, resulting in a seamless panorama or high-resolution image. 


\section{Visual Odometry}
\label{sec:background-visual-odometry}
An interesting combination of computer vision and mobile robotics is \textit{Visual Odometry (VO)}, which is the process of estimating the motion of a mobile robot using one or more onboard cameras.
Below,
%only
the case of a single camera and 2D feature coordinates is described.
There are two main approaches to compute the relative motion from video images: \textit{appearance-based} methods and \textit{feature-based} methods.
Appearance-based methods use intensity information, while feature-based methods use higher-level features (Section \ref{sec:computer-vision-feature-extraction-matching}).
In this thesis a feature-based approach was chosen, because high-level features are used for localization against a map.
Therefore, parts of the localization code can be reused for a Visual Odometry algorithm.
%Feature-based methods are more popular, because they are faster and more accurate than appearance-based methods.
%Therefore, this thesis focusses on feature-based methods and the remainder of this section describes Visual Odometry using the feature-based approach.

The set of images taken at time $t$ is denoted by $I_{0:n} = \{I_0, \hdots, I_n\}$.
Two camera positions at consecutive time instants $t-1$ and $t$ are related by a rigid body transformation:
\begin{equation}
\label{eq:computer-vision-visual-odometry-T}
T_t = T_{t,t-1} = 
\left[ \begin{array}{c c}
R_{t,t-1} & t_{t,t-1} \\
0 & 1
\end{array} \right]
\end{equation}
where $R_{t,t-1}$ is the rotation matrix and $t_{t,t-1}$ is the translation matrix.
The set $T_{1:n} = \{T_1, \hdots, T_n\}$ contains all subsequent motions.
The set of camera poses $C_{0:n} = \{C_0, \hdots, C_n\}$ contains the transformation of the camera with respect to the initial coordinate frame at time $t = 0$.

The objective of Visual Odometry is to recover the full trajectory of the camera $C_{0:n}$.
This is achieved by computing the relative transformation $T_t$ from images $I_t$ and $I_{t-1}$.
When all relative transformation are concatenated, the full trajectory of the camera $C_{0:n}$ is recovered.
An iterative refinement procedure like \textit{windowed-bundle adjustment} \cite{engels2006bundle} can be performed to obtain a more accurate estimate of the local trajectory.

The first step of VO is the extraction of image features $f_t$ from a new image $I_t$.
In the second step, these features are matched with features $f_{t-1}$ from the previous image (multiple images can be used to recover 3D motion).
The third step consists of computing the relative motion $T_t$ between time $t-1$ and $t$.
For accurate motion estimation, feature correspondences should not contain outliers, as described in Section \ref{sec:theory_feature_matching}.

The geometric relations between two images $I_t$ and $I_{t-1}$ are described by the \textit{essential matrix} $E$:
\begin{equation}
E_t \simeq \hat{t}_t R_t
\end{equation}
where $R_t$ is the rotation and $\hat{t}_t$ is the translation matrix:
\begin{equation}
\hat{t}_t = 
\left[ \begin{array}{c c c}
0 & -t_z & t_y \\
t_z & 0 & -t_x \\
-t_y & t_x & 0
\end{array} \right]
\end{equation}
The rotation and translation of the camera can be extracted from $E$.
%At least five feature correspondences are required.
%An efficient implementation is proposed Nister \cite{nistér2004efficient}, which has become the standard for 2-D to 2-D motion estimation.
A solution when at least eight feature correspondences are available, is the Longuet-Higgins's eight-point algorithm \cite{longuet1987computer}.
Each feature correspondence gives a constraint of the following form:
\begin{equation}
\left[ \begin{array}{c c c c c c c c c}
\tilde{u}\tilde{u}' & \tilde{u}'\tilde{v} & \tilde{u}' & \tilde{u}\tilde{v}' & \tilde{v}\tilde{v}' & \tilde{v}' & \tilde{u}' & \tilde{v}' & 1
\end{array} \right]
E = 0
\end{equation}
where $E = [e_1  e_2  e_3  e_4  e_5  e_6  e_7  e_8  e_9]^T$, 
$\tilde{u}$ and $\tilde{v}$ are the $x$ and $y$ coordinates in the first image and $\tilde{u}'$ and $\tilde{v}'$ are the $x$ and $y$ coordinates in the second image.

Stacking these constraints gives the linear equation $A E = 0$, where $A$ is the camera intrinsic matrix.
This equation can be solved by using singular value decomposition (SVD) \cite{klema1980singular}, which has the form $A = USV^T$.
Now, the projected essential matrix $\overline{E}$ can be found as the last column of $V$:
\begin{equation}
\overline{E} = U diag\{1, 1, 0\} V^T
\end{equation}
The rotation and translation can be extracted from $\overline{E}$:
\begin{equation}
R = U (\pm W^T) V^T
\end{equation}
\begin{equation}
\hat{t} = U (\pm W) S U^T
\end{equation}
where
\begin{equation}
W^T =
\left[ \begin{array}{c c c}
0 & \pm 1 & 0 \\
\mp 1 & 0 & 0 \\
0 & 0 & 1
\end{array} \right]
\end{equation}
Now, the camera position at time $t$ can be recovered by concatenating the relative transformations:
\begin{equation}
C_t = \prod_{k=t}^1T_{k}
\end{equation}
%Additional computations can be performed in order to retrieve the absolute scale of the translations.

The theory described in this chapter is used in the work presented in Section \ref{chapter:visual-slam}.
Projective geometry is used in Section \ref{sec:mapping} to warp the camera images on a map.
Furthermore, feature extraction is used to build a map.
Future matching is used in Section \ref{sec:localization} to perform localization against a map.
Section \ref{sec:visual-slam-visual-odemetry} uses Visual Odometry to recover the motion of the AR.Drone.